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ABSTRACT 


Here is described the operation of a system built both to 
model the vision of primate animals, including man, and serve as a 
pre -prototype of a possible object recognition system. It was 
employed in a series of experiments to determine the practicability 
of matching left and right images of a scene to determine the range 
and form of objects. 

The experiments started with computer-generated random -dot 
stereograms as inputs and progressed through random -square stereo- 
grams to a real scene. The major problems were the elimination of 
spurious ma v ches, between the left and right views, and the interpre- 
tation of ambiguous regions, on the left side of an object that can be 
viewed only by the left camera, and on the right side of an object that 
can be viewed only by the right camera. 

Rules were developed for eliminating spurious matches in the 
progressively more difficult objects. An arbitrary rule was developed 
for interpretation of ambiguous regions. 

In the experiments reported, comparison of left and right views 
was performed in terms of gray values, but the comparison could be 
made in terms of edges. An economical method of detecting edges 
was demonstrated. 

Stereo camera assemblies were designed and one of them built 
to permit the cameras to converge and together pitch, roll and yaw. 

A second stereo TV camera assembly has been built which has as its 
only moving parts two mirrors and a means of focussing. The above 
experiments were performed, before either of these camera assemblies 
was available, by exposing a single -view camera in two positions. 

We show that a scene on Mars, reported to earth in terms of its 
features, can be reconstructed on earth. 

Perhaps the two most useful results were (1) development of 
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the concept of a match space where the detected three-dimensional 
properties of a scene can be plotted and then examined for their form, 
and (2) the conclusion that a stereo TV camera is needed which will 
acquire both central and peripheral stereo pairs of images. 
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I. FN r DUCTION 


A. The Main Objective 

The main objective of the work reported here was to develop 
automatic means of classifying three-dimensional objects. The 
problem requiring solution at the start of this work was how to guide 
an automatic vehicle (robot) in the exploration of the surface of Mars, 
recognize objects and report these findings to earth. Problems of a 
similar nature have since arisen. One is the automatic classification 
of plankton when viewed under a binocular microscope. Another is the 
automatic classification of objects to be machined, assembled or 
otherwise manipulated in the course of manufacture. 

While classification is the goal, we have found the word "recog- 
nition" easier to use. Thus, "automatic recognition of objects" is the 
goal most often mentioned here. 

B. A Supporting Objective 

Since the only systems able to recognize three-dimensional 
objects are animals, an objective that was pursued, supporting the 
main objective, was to model animal vision. The first efforts in this 
direction were to devise models of the vision of a lower vertebrate, 
namely, the frog. The reasons for giving attention to this animal 
were, first, that it performs recognition within its eyeball, and, 
second, that the neurophysiology of the frog's eye is well understood. 
For example, a moving insect is recognized there and reported via 
the frog's optic nerve to its brain. Thus, by modelling a frog's eye, 
one devises an operating model of a recognition system. Reports on 
this part of our effort describe a first :rude model of the frog's eye 
(Ref. 1). a fine grain model of the bug-detector cell in the frog's eye 
(Ref. 2), a more rigorous model of this bug detector (Refs. 3-5), 
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and a first description of the shift register scheme (Ref. 6). 

McCulloch described animal nervous systems as multiple 
loops of information flow, with computation in every loop (Ref. 7). 
Except for the frog (Ref. 8), however, it was not possible to 
describe these systems in sufficient detail to enable immediate 
progress to a useful design. What was needed was an operating 
model which could be shaped by a series of small changes. Par- 
ticularly needed was a test system which would permit modelling 
the binocular vision of primates, including man. 

C. Te st System s 

Two test systems were built, both to model the vision of 
animals, including man, and serve as pre -prototypes of possible 
object recognition systems. The first system is described in Ref. 9. 
The second is pictured in Fig. 0. The TV camera of this second 
system was usually aimed at the simulated Mars scene (Fig. 1), 
which consisted of rolling terrain, made of papier-mache on a 4 ft. 
by 8 ft. piece of plywood, and a painted backdrop, all created by 
Dustin Thomas. Lighting was usually unidirectional from the right. 
The Type C3 camera was occasionally aimed out the window to test 
the ability of the system to determine the range of objects on the 
roofs of adjoining buildings. The edge of a building on the Boston 
skyline was imaged to serve as an infinity point in the adjustment 
of the mirrors of the camera assembly. 

The TV camera, the central element in stereo TV, generated 
a TV image displayed 30 times a second on the monitor. The TV 
camera was also scanned slowly along vertical lines, under com- 
puter control, to acquire an image for processing. Any image 
stored either on magnetic tape or in core memory can be displayed 
and photographed on the oscilloscope, the intensity of which can be 
modulated to display gray values. 
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STEREO TV 
CAMERA 







D. Approach 

Once this equipment was completed, we turned our attention 
to making it work. There were two examples before us. One was 
the modelling of the vert Q brate visual system, one cell type at a 
time, progressing from the retina through the lateral geniculate 
nucleus to the visual cortex. Fukishima demonstrated that this was 
possible (Refs. 10, 11). 

The other approach was that of Julesz, in which he had 
employed a computer to compare the unprocessed left and right 
images of a stereogram to extract range data, thus assuring that 
the structure of the scene would be acquired automatically. We 
decided to follow the latter route. Having done that, we now find 
that it is often desirable to introduce another stage of computation 
between the acquisition of left and right i *ages of a scene and 
comparison of these images. It might appear that we will thus 
arrive at the same result as if we had taken the first approach. In 
fact, however, by providing for the structure first we have sought 
and found economies in data handling that we might not have found 
in taking the first approach. 

E . Other Supporting Objective s 

To move toward the main objective stated in A above, other 
supporting objectives had to be pursued; namely, detection of edges 
with a minimum amount of hardware, reconstruction of the appear- 
ance of a scene from detected features, design of stereo TV cameras 
and design ot the mounting of one such camera on a rover. 

F. How This Paper Differs from First Version 

The first printed version of this papier is Ref. 12. It is revised 
here to provide a more complete introduction, to clarify the 
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description of the simulation program EXPER in II J, show in II N 
how range accuracy can be increased over that in the example of II L, 
propose another method of stereo processing in II I, expand III to 
include examples of the detection of spatial frequency, describe in 
more detail the development of stereo TV cameras in V, and add 
comments on how this work models primate vision. The only new 
illustration in Sections I and II is Fig. 0, placed ahead of all of the 
previously -numbered illustrations. With the addition of Fig. 19 to 
Section III, all of the previously -numbered figure numbers, from 19 
to 29, move up one. With the .addition of Fig. 30, the final figure 
number move up two. Because many more references have been 
added, the reference numbers differ here from the first printed 
version. 



Fig. 1 . Mars -like scene. 
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TV CAMERAS 



PLATFORM 


Two strategies of visual processing: Locating, 
represented by rectangles, and identifying, 
represented by dashed lines. 






II. MEANS OF AUTOMATIC A LLY DETERMINING THE RANGE 

of small areas of a scene 

A. Object ive 

The immediate objective of the work reported in this section 
was automatic determination of the range of small areas on the 
surfaces of objects, A further objective was to employ this range 
information to automatically determine the form of those objects. 

The method employed in pursuing the first objective consisted 
of comparing the gray values of picture element (pixels) in a left 
image of the field of view with gray values of picture elements in a 
right image. Because this appeared to be the simplest possible way 
of comparing left and right images it enabled us to define concepts 
basic for future work, such as match space, spurious matches and 
ambiguities. For that future work, comparison of other variables 
in the left and right images is described in III. 

Two automatic means of determining range compete for our 
attention. One uses a laser beam that is deflected by a mechanically- 
driven mirror under computer control. The other uses a stereo TV 
camera which is both controlled and interpreted by computer. 

Because our long-term goal is automatic recognition of objects 
characterized by edges and lines (features; which a laser may not be 
able to detect, we pursued the development of stereo -TV -computing. 
Laser range -finding can supplement stereo -TV -computing to deter- 
mine to greater accuracy the range of objects selected by stereo -TV - 
computing. 

B. Opto -Electro -Mechanical Strateg ies 

In bringing a stereo pair of images onto the face of a camera 
tube or tubes, several opto-electro-mechanlcr.l strategies are 
possible. Figure 2 illustrates two that can be employed consecutively. 
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The first works on a coarse scale and is called "locating". The 
second works on a fine scale (high resolution) and is called "identi- 
fying" (Ref. 13). Followirg the first strategy each camera subtends 
the wide view represented by the rectangles to discover the rock, the 
hole and the cliff. Following the second strategy the two cameras in- 
vestigate with higher resolution optics details along the edge of the rock. 

As part of the first strategy, the cameras of Fig. 2 are shown 
mounted on a table that tilts within one gimbal and turns on another. 
Each camera is also supported by a gimbal on the platform, repre- 
sented by a black dot on the camera case. The range of each object 
can be computed from the angles formed by the camera axes and the 
distances between the cameras, when both cameras are centered on 
the object. The first strategy has been pursued in the design of the 
gimballed mirrors and gimballed cameras described in Section V. 

The second strategy is to identify the features and, from the 
position of those features in three-dimensional space, the form of 
objects. The second strategy has been pursued in the work described 
in Sections II and III of this paper. In Section II the features are gray 
levels formed from the pattern of luminances in the scene. In Section 
III they are edges. 

The second strategy consists of two sub -strategies. The first 
substrategy, called "stereopsis", requires two views and yields in 
man "the experience of relative depth only" (Ref. 14). The method 
of comparing two views, on the other hand, that we have designed, 
yields absolute depth. The second substrategy, called "cognitive 
processing", requires the storage of features and the relations 
between features so that these can be compared to features and their 
relations in the image. This second substrategy is not employed in 
the examples presented in this paper. How it could be employed is 
considered in II H. 

A third strategy, while not optical or mechanical, is electronic 
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in the sense of being computational. It can determine the class of 
thing the stereo -TV computer looks for. This strategy was described 
first in "Assembly of Computers to Command and Control a Robot" 
(Ref. 15), and is being described in more detail in Refs. 16 and 17. 

In those reports the word "robot" is used, as McCulloch used it, to 
mean an animal or a machine (Ref. 18). 

C. Automatic C omparison of One Sc an Lin e of Left with One Scan 
Line of Right TV Image 

Both of the first two strategies require a method of comparing 
left and right images. The method about to be described stems from 
the work of Bela Julesz (Ref. 19), and was developed into its present 
form by the second author (Ref. 20). 

Fig. 3 shows at the left two TV cameras whose parallel optical 
axes extend into a space with coordinates x, y and z (measured in 
meters). At the lower right is a three-dimensional structure where 
a model is formed of the scene at left. It has the dimensions, shown 
in the lower right corner: x' and y' in pixels, d in integral values of 
disparity. We call the structure "match space". Disparity is related 
to range by Eq. (1) in II G. 

In our simulations of proposed hardware, match space is only 
36 values of disparity deep, from d=0, corresponding to infinity, to 
d=35. Only one x'-d plane is shown in the match space (m -space) of 
Fig. 3. Note that the black squares in this plane approximate the 
shape of a section of the rock at left. Each black square represents 
a 1 in the computer memory. 

The information in this match plane comes from the scan lines 
on the face of the left and right camera tubes. The left scan line is 
an image of a V-shaped area projecting from the left camera lens 
into space, the right scan line the image of a similar V-shaped area. 
Where the two V-shaped fields overlap is the binocular field of the 
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cameras. 

As the electron beam in the left camera tube starts to sweep 

across the line pictured on the face of that tube, the voltage of the 

5 

camera's output is converted to one of 2 or 32 levels, which we call 
"gray levels". Expressed as a five bit word, this gray level enters 
the first column of the left -image shift register. When the electron 
beam advances to the next pixel, the first column is shifted to the 
right and a new column takes its place. This process continues until 
the left -image shift register is full. 

After 36 five bit words have been formed and shifted, as 
described, the same process begins in the right camera tube and 
right -image shift register. Thus, when the left image reaches the 
end of its shift register, the right is only 36 positions behind. There- 
after, the right marches past the left and the two are compared after 
each shift. 

The number of pixels by which the image of a point in the right 
image falls short of overlapping the image of the same point in the left 
image is called the disparity of that point. In Fig. 3, the effect is 
shown of comparing the left and right images when the disparity 
between left and right images of a point is 35, 34, 33, 32, 31 and 30. 
For example, when the disparity is 35, a spurious match is formed 
due to the fact that images with the same gray level are not neces- 
sarily images of the same point in the scene. When the disparity is 
34, two matches are made, one for each side of the rock. The process 
of shifting and comparing continues until, at d=30, the edges of the 
rock have been mapped. 

That comparisons are made for only 36 values of disparity is 
due to the size of the memory in the computer used for the simula- 
tions. Only one form of an x'-d plane is shown in Fig. 3, namely, 
one in which successive lines of matches are "justified" to the right, 
as viewed from the direction of the camera. (The word "justify" is 
a printer's term which means to line up lines of type evenly. ) 
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BINOCULAR SPACE 


ANAIOG TO DIGITAL CONVERTERS 

LEFT CAMERA TUBE 
RIGHT CAMERA TUBE 
LENSES 


COMPARATOR 


SHIFT REGISTERS 
RIoHT IMAGE 
v LEFT IMAGE 


SPURIOUS 

MATCHES 



Fig. 3. Diagram of computation to form a model in 
match space of an object in binocular space. 



D. Methods ^ of^ Mapping Binocular Space^into Match Space 

Fig. 4 shows a plane of binocular space divided into quadri- 
laterals of minimum uncertainty, determined by the division of each 
scan line into pixels. We say ’’minimum" because the number of 
quadrilaterals will be as few as illustrated only if, (1) a camera tube 
is employed, of which the maximum resolution in TV lines approxi- 
mates the division into pixels shown here, and (2) the spatial fre- 
quency of the high -contrast detail in the scene is that which leads to 
this resolution. 

The uncertainty pictured in Fig. 4 is due to the choice of focal 
length of the lens and to the characteristics of the camera tube. This 
uncertainty is optoelectronic. The uncertainty due to positioning of 
the camera is electromechanical. The need to add the two uncertain- 
ties can be obviated, at least during a fixation, by rigidly attaching 
the optoelectronics for strategy 1 to the optoelectronics for strategy 
2. The Type El stereo TV camera, described in V, is designed this 
way. 

Fig. 4 shows the effect of projecting the pixels on the face of 
the two camera tubes out into binocular space. Because we see them 
here only in plan view, we call the intersection of two pixel rays from 
the left a "quadrilateral of uncertainty". Actually it is a polyhedron 
oi uncertainty (Ref. 21). 

Note that the number of quadrilaterals formed by intersecting 
rays increases as z increases. 

When the lines of matches formed in match space are justified 
to the left, as shown in the lower right comer of Fig. 4, a right - 
camera view is formed. This can be verified by following, first, the 
left ray of the right camera which can be seen to form a continuous 
succession of quadrilaterals with rays from the left camera. Next 
follow the succession of rays from the right camera which intersect 
a single ray from the left camera in a manner that leads to the jagged 
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Fig. 4 Binocular space divided into quadrilaterals of 
uncertainty by rays drawn from pixels in left 
and right images. Along any ray only alternate 
quadrilaterals of uncertainty are shaded. 
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right edge of the right -camera view. 

A left -camera view and a center view are also presented in 
Fig. 4. The left- and right -camera views are needed for the method 
of eliminating spurious matches of areas presented in II J. In hard- 
ware, only one model of matches will need to be formed of binocular 
space. The two views can be obtained by two sets of inter -wiring. 

E . H ow M atches Are Made and Viewed 

$ 

Fig. 5 shows how first-stage matches are made in the system 
of Fig. 3. Fig. 5 pictures, from above, the right ends of the two 
shift registers at three different positions of one scan line of the 
right image. At the first position, where disparity equals 32, two 
matches are made, one of them spurious. In the second position, 
where d=31, another match is made and in the third position where 
d=30, another. 

Exact (e=0) first -stage matches such as these can usually be 
made only between computer -generated images. Between images of 
a real scene, tolerances are needed at both first and second stages 
of match, to allow for noise in the electronics or noise in the scene. 

By the latter we mean, for example, unequal reflection of light from 
the same spot in the scene to the different viewpoints of the two 
cameras. 

The tolerance at the first stage of match is the allowed differ- 
ence between the gray value of one pixel in the left image that is 
considered matched to a gray value of one pixel in the right image. 

We call this tolerance ( and assign it the values shown m the second 
column of Table 1. 

The tolerance at the second stage of match is in the form of a 
threshold and accompanies our requirement that an N x N area of pixels 
surrounding one pixel in the left image be compared to an N x N area of 

* This is analogous to "local stereopsis" in man (Ref. 22). 
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SHIFT REGISTERS 




Fig. 5. Computation of first -stage matches at the 
right ends of three lines of the x'-d plane 
picture^, in Fig. 3. The tolerance of first -stage 
match, € , is here C. 
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pixels surrounding a second pixel in the right image. The number 

of first stage matches required for a second stage match is 

£ (LOWLIM/ 100), where LOWLIM/100 is a preselected percentage 

of the first stage matches. Values of N and LOWLIM/ 100 used in the 

examples of this paper are shown in Table 1. As will be explained in 

o 

II K, the program EXPER increases the search area beyond N to 
resolve "ties", i. e. , surfaces of the same number of first-stage 
matches, in an attempt to find the larger area. 

The above description glosses over the step that preceded the 
simulation of the matching of the left and right images of the Mars- 
like scene pictured in Fig. 16. First, the left image of Fig. 16 was 
digitized and stored on magnetic tape. The camera was then moved 
50. 8mm. to the right and the right image digitized and stored on mag- 
netic tape. A human operator then determined the vertical disparity 
(y shift) of the left and right images by matching the digitized video 
waveform of one scan line of the left image with the digitized video 
waveform of a scan line of the right image with the aid of the program 
MSTUDY. The scan lines he compared were of a region of uniform 
range, namely, the flat back drop. The operator eliminated the 
vertical disparity by entering its amount, usually only a few lines, 
into the program STEREO. 

STEREO forms on magnetic tape the match space required by 
the next simulation program. (Magnetic tape is used only in the 
simulation.) The proposed hardware, described in Section III, would 
hold only as many digitized lines of an image as are required by the 
filters that examine them. It would store only N planes of m -space. 

The need for the y shift is eliminated in the C3 camera 
described in Section V, but not in the D1 or El cameras. For the 
latter two, either the amount of the vertical disparity will have to be 
computed automatically and used to match the two images vertically 
or the next stage of computation must be designed to tolerate vertical 
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TABLE 1 


VALUES OF THE MATCH VARIABLES 


Subject 

c 

N 

LOWLIM/lOO 

in 

STEREO program 

in 

ST ROUT program 

Random -dot stereogram 

0 

3 

. 50 

(Fig. 9) 


in 

EXPER program 

Random -block stereogram 
(Fig. 14) 

0 

3 

. 50 

Mars -like scene 
(Fig. 16) 

4 

13 

. 50 
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disparity. The nervous system of vertebrate animals tolerates some 
vertical disparity in computing depth (Ref. 23) and interprets this 
disparity in the ''induced effect" (Ref, 34). 

The contents of m -space can be displayed in the form of either 
x'-d sections, such as those of Figs. 10, 11, 12, 15 and 17, or in 
the form of a range map. An x'-d section of m-space can be generated 
for any value of y by the program MSTUDY. The range map is 
described in II I. 

F . U se of Mod el in Match Space 

After a model has been formed in match space at least two 
questions need to be asked: (1) Is each match plotted there probably 
true or probably spurious? ,2) Are there surfaces in the original 
scene that are viewed by one camera, not the other and are there- 
fore not modelled in match space? We call such surfaces "ambiguous". 

Subsections II H to II L present the rules we have devised to 
answer the above questions. The rules evolve in three stages and are 
illustrated by scenes of increasing complexity. 

The simplest scene is ot.«i generated by computer from dots of 
random values of gray, briefly called "random dots". (Our usage 
here differs from that of Bela Julesz (Ref. 24). He uses the term 
"random dot" to describe a two -value, black-and-white display. ) 

Such a scene is simplest because the probability of a spurious surface 
in match space is negligible. 

A scene of higher order complexity is again computer -gene rated 
but now of random uniform areas of gray. We call such scenes "random - 
square" or "random -block" pictures. Here the probability of a spurious 
surface of matches is greater, because any spurious event will auto- 
matically be a surface. 

A third order of complexity is a real scene. This cam be 
processed by the rules devised for random square pictures. 
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G. Geometry of Binocular and Match Spaces 

Before proceeding with rules for interpreting a model in match 
space we need to take another look at the geometry of a stereo TV 
camera assembly. Figure 6 diagrams the same two cameras pic- 
tured in Fig. 3, the same binocular and match spaces, except that 
the views are now from behind the cameras. The disparity between 
the images and of the point P is the difference, d^-d^, 
whether the object is between the optical axes, as in Fig. 6 or at 
one side of them as in Fig. 7. 

The example for which range needs to be computed is the 
Mars -like scene of Fig. 16. The range, z, of any point, P, 
measured from the optical centers of the lenses, is 


7 


_ 2bf 
" T' 


(1) 


where the variables have the meanings given in Fig. 6. (For a 
derivation, see Appendix A. 1. ) There is an uncertainty, \z, in this 
measurement for which an equation is derived in Appendix A. 2. 

In the system, the operation of which we describe here, the 
number of pixels in each line scanned by the TV camera for both left 
and right views is 512. That is, the counter that determines the 
position of the electron beam along the horizontal axis of each TV 
image counts to 512. The counter that determines the vertical 
position also counts to 512. However, only the central 256 columns 
and 2 56 scan lines were used in acquiring the images of Fig. 16. 
Each computer -generated image used as an example in the next 
three subsections also measures 256 x 2 56 pixels. Only 128 
columns, approximately at the center of these images, were com- 
pared and the matches plotted in the x'-d sections shown in Figs. 

10, 11, 12, 15 and 17. 
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Ml 


Fig. 6. (a) Geometry of Fig. 3, when object viewed is 

between parallel optical axes, 

(b) Match space corresponding to the binocular 
space in (a). 
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Fig. 7. Geometry of Fig. 3 when the object viewed is at 
the left of both optic axes. 
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H. Rules for Process ing Ran dom -aot Stereograms 

Random -dot stereograms were generated in our experiments to 
investigate the detection of spurious matches and the elimination of 
ambiguous regions. Figure 8 is a plan view of objects, O, which 
hang in space before backgrounds, 3. Figure S is a stereogram of 
a scene like that in Fig. 8 formed from identical random dot patterns. 
A 64 y 64 pixel region in the left background was shifted 5 pixels to 
the right so that it appears nearer than the background. The region 
uncovered by the shift was filled with more random dots. 

The simulation program STEREO compares the images of 
Fig. 9, makes first-stage matches and maps them in center-view 
m -space. STEREO produces one plane of first stage matches at d=C 
corresponding to the background, a second plane of matches at a = 5 
corresponding to the square floating ?n space and some spurious 
matches. Figure 10 shows an x'-d section of th»s m -space. (The 
checkered area of m -space in Fig. 3 is a similar section. ) 

The simulation program STROUT examines each "region" which 
is a volume extending from front to back of m-space, N pixels high 
and N pixels wide, seeking a surface of at least N (LOWLIM/IOO) 
matches in a plane perpendicular to the optical axes of the cameras. 

A more complex simulation program (and a more useful one) would 
search for arbitrarily oriented surfaces in m-space r When, in 
examining the stereogram of Fig. 9, STROUT finds a plane that 
meets the conditions N*3, LOWLIM=50, it maps the point in a "range 
map". By this we mean a one-eyed view of the scene in which each 


* A good quality viewer for the stereograms in this report is the 
Model PS-2, made by Air Photo Supply Corp. , 158 South Station, 
Yonkers, New York, 10705. Use 63mm separation of lenses in 
this viewer unless your eyes are closer together or further apart. 
Most copies of this report contain a viewer made of cardboard 
and plastic lenses. This viewer can be purchased only in a large 
quantity. 
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Fig. 9. A random -dot stereogram depicting a 
square floating before a background. 
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point represents, by a gray value, the disparity (or range) of the 
nearest surface in the scene at that pixel. 

Because it requires that a percentage of the points in an N x N 
plane be first -stage matches, STROUT rejects isolated first -stage 
matches as spurious. When the image of a plane perpendicular to 
the z axis (Figs. 8 and 9), is model'ed in m -space (Figs. 10 and 11), 
all the matches lie in one x'-y' plane. Thus a spurious match in a 
given x'-y' plane will not have many neighboring matches in that 
plane, while a second stage match will adjoin other matches. 

What should be done in a region of m -space where there is no 
second -stage match is simple for the examples given in this report, 
but can be complicated for other examples. Let us consider first 
the simple examples so far presented. STROUT labels as ambiguous 
regions of m -space where no match is found (Figs. 8, 10 and 11). A 
subroutine RESOLV then follows the observation of Julesz (Ref. 25) 
about simple stereograms such as that in Fig. 9: "Regions of am- 
biguity are always perceived as being the continuation of the adjacent 
area that seems farthest away". RESOLV searches m -space one 
plane at a time for regions marked as ambiguous. When it encounters 
one, it examines the first non -ambiguous region to the left and right 
and chooses the one which represents a surface farthest from the 
camera to replace the marked ambiguity. The regions a^ and a^ 
(Figs. 8 and 10) will then have been filled in and the ambiguity 
removed. 

This method of treating ambiguous regions is appropriate 
when the object viewed is a plane, as in Fig. 9, but suppose the 


* This process is analogous to what Julesz calls "global stereopsis". 
To quote him: "With increased dot density the visual system cannot 
find uniquely the corresponding points, and a new process has to be 
invoked which can resolve ambiguities by global considerations. " 
(Ref. 22) 
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Fig. 10. An x'-d section of the m -space generated 
from tho stereogram in Fig. 9. 



Fig. 11, x'-d sections of m -space showing, in (a), 

ambiguous regions a^ and a_ on the left and 

right sides, respectively, of une model of the 

object and, in (h), hnw both the image of the 

object and the background behind it are 

modelled when w < (d , . - d, ), 
obi b 
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object is solid and the left side is viewed only by the left camera 
and the right side only by the right 9 For such a scene, another 
technique is needed; namely, one in which the memory of textures 
and patterns, viewed before, is applied to interpreting what can be 
viewed by only one camera. This is cognitive processing. 

I. How Should Ranges in a Scene Re Presented 9 

In building a sequence of computations into a process, such as 
the detection of range and shape, means are needed of viewing the 
steps in the process. Such a means is a projection onto an x'-y' 
plane of all values of disparity in m -space. Since an array of 
numbers is difficult for a human being to interpret, we have written 
a program PICT which displays these numbers as levels of gray on 
a cathode ray tube where they can be photographed. Figure 18 is 
such a photograph. While the levels of gray correspond to measure- 
ments of disparity, the impression given the viewer is that of range. 
Hence we call this display a "range map". 

After using range maps for several years we find that they 
suffer a defect. When, as in Fig. 8b, the object O is small and 
near enough to the cameras that they view both the object and its 
background, the range map is unable to report both. In Fig. 8b the 
region between the two ambiguous regions, and a^, is visible to 
both cameras, but the range map cannot show it. 

A stereogram range map could obviate this difficulty. How- 
ever, if EXPER did not eliminate the more distant of two surfaces 
in the same region, it might not eliminate spurious area matches 
either. Therefore, for the present, we accept the limitation that a 
surface behind a front surface cannot be shown in a range map. 

J. Eliminating Areas of Spurious^ Matches 

Because areas of spurious dot matches are not likely to be 
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formed from a random -dot stereogram, STROUT is not equipped to 
reject such areas. Such areas are likely to be formed both from 
random -square stereograms (Fig. 14) and from stereograms of real 
scenes (Fig. 16). The latter tends to contain areas of the same 
value of gray (within tolerance () because its dots are not random. 

To eliminate areas of spurious dot matches we devised two 
simulation programs, EXPER and FUSER. Let us consider with 
the aid of Fig. 12 how EXPER might be used alone for this task. It 
will be shown in Fig. 15 that a model in m-space of an area of 
uniform gray in a scene appeara as a parallelogram in an x'-d 
section of m-space. Because parallelograms are awkward to 
illustrate in this stage of our discussion, we will continue to use a 
line to represent a plane in Figs. 1? and 13. 

Assuming again that the background of the scene pictured in a 
stereogram will be continuous, we can see how right -view and left- 
view m -spaces can be used to reveal whether or net a surface is 
spurious. Consider again the scene mapped in Fig. 8b. Three 
forms of the simulation program STEREO form, from a stereogram 
of this scene, left -view, center -view and right -view models in 
m-spaces. (The program MSTUDY generates the x'-d sections of 
these m-spaces shown in Figs. 12a, 12b and 12c.) Because, in 
Fig. 12a, the model of the object hides the right ambiguous region, 
and, in Fig. 12c, the model of the object hides the left ambiguous 
region, the model of the object is considered true, not spurious. 

To make this check automatically, the data of the left -camera and 
right -camera views needs to be converted £o center-view m-spaces 
and these m-spaces compared. That will be done by the program 
FUSER. 

EXPER performs more operations than we have so far 
described, as Fig. 13 shows. Figure 13 begins at (a) with the plan 
view of a simple scene: an object O in front of a background B. 
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a L a R 



(a) Left-camera view 



(b) Center- view 



(c) Light -camera view 


Fig. 12. x 1 -d sections of three match spaces formed 
from the same stereogram. 
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Fig. 13. Diagram showing the functions of the three 
simulation programs, STEREO, EXPER 
and FUSER. 
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Step (b) is performed by MSTUDY to show .’in x'-d section of center- 
view m -space and two spurious matches, Sj and S^, which happen 
to hide the two ambiguous regions a^ and STEREO forms left- 

view and right -view m -spaces of which MSTUDY forms the x'-d 
sections shown at (c). 

Where there arc two or more surfaces in a region of either 
left -view or right -view m -space, EXPER eliminates all except the 
largest in the following manner. At each set of values of x' and y , 
EXPER begins by searching an N x N region, in this case 3 x 3, for 
possible surfaces. If two or more are encountered, EXPER enlarges 
its area of search until it finds the largest. It then retains only the 
point on this iargest surface. For example, in the left -view of 
m -space (Fig. 13c), because Fj is smaller than O, EXPER elim- 
inates S^; and, because S 9 is smaller than O, EXPER preserves 
only those points in S£ which do not hide points in O. In the right 
view of m -space, because Sj is smaller than B, EXPER preserves 
only those points of not hidden by B; and, because is smaller 
than B, EXPER eliminates S2. 

Employing the subroutine RESOLV, described in II H, EXPER, 
in the right -view m -space, extends Sj leftward into the ambiguous 
region between O and Sj. The results of these operations by EXPER 
are shown in Fig. 13d. Note that S£ survives in the left -view m -space 
and Sj has grown in the right -view m -space. Thus EXFJR alone 
cannot eliminate all areas of spurious matches. 

A simulation program that compares the EXPER-processed 
left - and right -view m -spaces to complete the elimination of areas 
of spurious matches is FUSER. From each m -space illustrated in 
Fig. 13d, FUSER forms the two center-view m -spaces shown at (e). 
FUSER then compares these two m -spaces, preserving only those 
surfaces common to them both. Finally, F USER employs RESOLV 
to fill in remaining ambiguous regions. 
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The assumption is that each surface in the scene is opaque, 
and therefore blocks out, behind it, one region from the left camera 
and another region from the right camera. Forming left -view and 
right -view m -spaces is an attempt to check for this condition. Com- 
parison of what is mapped in the left -view m -space with what is 
mapped in the right eliminates surfaces that do not satisfy this 
coastraint. 

K. Random -Square Stereograms 

Figure 14 is a random -square stereogram of a large square 
floating in front of a background. It is formed of random -gray -value 
squares measuring 4 pixels on a side. Figure 15 is an x'-d plane 
of center-view m -space formed by MSTUDY from this stereogram. 

The match of left and right views of each random square of 
Fig. 14 is a diamond because each square in binocular space is 
perpendicular to the z'-axis. If the square is skewed with respect 
to this axis, the match is a parallelogram. The shape, whether a 
diamond or parallelogram, is formed as the left view marches past 
the right in the scheme of Fig. 3. At the first overlap of left and 
’right images of a uniform area of gray, the point of the parallelo- 
gram is formed. After a shift of the right image, the next line of 
the parallelogram is formed, two pixels wide, at a larger value of 
disparity. When the two small squares overlay each other, the 
widest part of the parellelogram of matches is formed. 

In its search for the largest number of matches along each y' -d 
line EXPER finds the widest part of each diamond and retains only 
that. Thus the diamonds of Fig. 15 are reduced to the lines of Fig. 13. 

L. Processing of a Real Scene 

Figure 16 is a stereogram of the scene of Fig. 1 recorded when 
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Fig. 14. Random -block stereogram of a square floating 
before a background. Each block is 4 x 4 pixels 
of uniform gray value. 



Fig. 15. x'-d section of model in m -space generated 
from the stereogram in Fig 14. 
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the TV camera was approximately 2m from the rock and the lighting 
was from upper right. Other test conditions are given in Tables 1 
and 2. 

One camera was used to obtain the left view, then moved 
50. 8mm (2 in. ) to the right to obtain the right view. The axes of 
the camera in its two positions were parallel. Since the axes of the 
cameras were not changed, this was the fixation viewing of the 
second strategy of II B. Since a lens of 2 5. 4mm focal length was 
used, the resolution was that of the first strategy. Such a com- 
promise is necessary when only one focal length is employed. 

Figure 17 is an x'-d section through the left -justified m -space 
formed by STEREO from 128 columns of a stereogram similar to 
Fig. 16. The section is at the y-value of the stereogram indicated 
by the two black lines in the margins of Fig. 16. One of the diamonds 
in Fig. 17 models the stick. Another may model the shadow of the 
stick. The right side of Fig. 17 contains at the front spurious 
matches and further back true matches of features in the painted 
backdrop. 

Figure 18 is a range map formed by EXPER from the m -space 
of which Fig. 17 is a section. Lightness of gray indicates nearness, 
darkness of gray, distance. Thus the rock stands out clearly. Two 
shades of gray at the top of the rock, one shade hooking to the right, 
indicate how long the rock is. The stick and its shadow get pro- 
gressively darker as it recedes, leading to the backdrop which is a 
uniform black. Because occluded regions have not yet been found 
and filled in, two of them appear as blacx areas below the rock and 
below the stick. A picture was made of the output of FUSER after 
it had filled in these occlusions, but it was poorly displayed so we 
do not include it. 
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Fig. 16. Stereo images of the Mars -like scene of Fig. 1, 
acquired by television camera, digitized, then 
displayed one at a time or. an oscilloscope. 



Fig. 17. x' -d section of the right camera view of match space 
formed from the stereogram of Fig. 16. The y value 
of this section is indicated by lines in the margins of 
Fig. 16 
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TABLL' 2 


TEST CONDITIONS FOR EXAMPLE OF 11 L 
(In addition tc jhose given in the bottom line of Table 1) 


Interoculr.r distance, 2b 

Focal length, f 

Distance from camera to 
nearest object 

Camera tube type 

TV camera and control 

Computer 

Display oscilloscope 
Display -tube type phosphor 


= 50. 8mm (2 in. ) 

= 25. 4mm (1 in. ) 

= 2m (approx. ) 

= GEC TD8484 

= Colorado Video, Inc. , Type 501 

= Digital Equipment Corp. PDP-9 
with 8K words of core memory 
and 2 DEC tape drives 

= Tektronix 530 

= P4 
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Fig. 18. Range map of one mutch space of the stereogram 
in Fig. 16. The black areas below the rock are 
ambiguous because they were viewed by the 
camera in one position and not in the other 
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M. Determining Form 

McCulloch observed that texture and form are not primarily 
visual phenomena. They are "the way a surface or object would 
feel if you could feel it" (Ref. 26). To determine form this way 
range data should be fed both to a touch system and to the controls 
of an arm and hand that is capable of reaching into the space 
viewed by the stereo TV camera. Since judgement of form from 
camera input data will be only a guess, the arm and hand can con- 
firm or deny this guess. Design of a system to operate this way 
is considered in the final paragraph of V. 

N. Range Accuracy 

How accurate is the above system, assuming that the ambigu- 
ities just described have been removed and the spurious matches 
eliminated? How can range accuracy be increased? 

The stick in Figs. 1, 16 and 18 is 1.27m (50 in.) long. In the 
original Polaroid print of Fig. 18, there are seven levels of gray 
along the length of the stick. Dividing seven into 1. 27 indicates that 
intervals of range have been detected of about 18cm (7. 1 in. ). This 
is approximately the uncertainty that is predicted when measured 
characteristics of the camera are substituted into Eq. (2) below. 

The rock, the length of which is 20cm from front to back, is shown 
as two levels of gray. 

Appendix A derives the following formula for range uncertainty 


Az = 


As • z 2 
bf -As • z 


( 2 ) 


Because As is very small with respect to z, the second term of the 



denominator may be ignored. Eq. (2) then becomes 


\z 


As • z 2 
bf 


( 3 ) 


Assuming that the camera tube is selected for the smallest 
possible uncertainty, \s, in the position of a point, it can be seen 
that range accuracy can be increased either by increasing the focal 
length, f, of lenses, by increasing the separation of the optic axes, 
2b, or by bringing the camera nearer to the objects to be examined 
(smaller z). If longer focal length lenses are used to examine 
details, short focal length lenses are still needed as finders of the 
deiails to be examined (strategy 1). Thus, a stereo TV camera is 
needed with lenses of two focal lengths. If wider separation is used 
between the optic axes, the axes need to be converged on the objects 
of Fig. 16, inquiring trigonometric functions in Eq. (2). Such 
functions can be employed but were avoided in this first pas 0 through 
the problem. Obviously the camera could have been brought nearer, 
but then it could not have viewed as many objects as in Fig. 16. 

Increasing by a factor of ten the focal length of the camera lens 
employed in the test described in II L will reduce range uncertainty 
to 1. 8cm (0. 7 in. ). Increasing the interocular distance by a factor 
of 4, to widths described in V, will further reduce the range un- 
certainty to 0. 45cm (0. 18 in. ). However, these changes can bring 
other problems. Lenses of focal length this long cannot be accom- 
modated in the Type C3b camera. While such lenses can be accom- 
modated in the Types D1 and El, there is a problem of vertical 
misregistration (vertical disparity) in these types, as explained in 
II E, which special computation is required to remove. Solutions to 
these problems are being devised. 
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SUUAHE WAVE SPATIAL 
FREQUENCY IN TV LINES 
PER WIDTH OF TEST PATTERN 



a) 'Line Selector” test pattern (Westing!. ouse resolution chart 
ET-1332 purchased from Tele -Measurements hie., 

145 Main Avenue, Clifton, N. J. ), reproduced 1/3 full size. 


SQUARE WAVE SPATIAL FREQUENCY 
IN TV l INES PER WIDTH OF TEST PATTERN 

36 129 257 375 530 692 
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b) Plot of digital words formed by the system of Fig. 0 as the 
electron beam in the camera sweeps the image of (a). 


Fig. 19. Test pattern and amplitude response of TV camera 
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III. COMPUTATION TO EXTRACT OTHER FEATURES 


Al. Square -Wave Frequency Response of Camera 

Other features, besides range and form, that may be detected 
in a scene, are the edges and lines shown to be detected by cats and 
monkeys (and probably also by human beings), and the reflecting 
properties of surfaces of interest. 

The computation of an edge by a TV camera-computer system 
needs to be described in the language of TV cameras, computers and 
picture processing. A TV camera makes a "square -wave response", 
which is converted into "digital words" for the computer. Each word 
indicates the "gray value" of a "pixel". An edge is a difference in 
gray values (Ref. 27). 

To determine the square -wave frequency response of the camera 
we replaced the stereo optics of Fig. 0 by a single lens and aimed the 
camera at the transparency of Fig. 19a which we lighted from behind. 

We positioned the camera so that the transparency and its margin 
were just included within the scanned area of the camera tube. 

Figure 19a provides spatial square waves of 11 different fre- 
quencies, seven of which are labelled. A square -wave spatial fre- 
quency, to a TV camera, is the number of lines, both "black" and 
"white", that can be imaged in the scanned area of the camera tube. 

We measure this frequency in TV lines per width of test pattern. Be- 
cause there are both a "black" and a "white" TV line in each cycle, 
the number of cycles per width of test pattern is one half this number. 

As the electron beam in the camera tube scans along one scan line, 
the camera generates a voltage which an analog -to -digital converter 
changes to digital words. The voltage from one scan line is converted 
to 512 digital words, one for each of 512 positions along the scan line. 
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Figure 19b plots digital words formed as the beam sweeps through 120 
of these positions. (Of the 512 positions in the camera's image of 
Fig. 19a, the central 256 are presently acquired by the computer. 

Of these, positions 60 to 180 are plotted in Fig. 19b. ) 

Each step in Fig. 19b is a pixel whose gray value is indicated 
by a digital word on a scale from 0 to 31. Figure 19c plots the 
average change in amplitude between the black and white halves of 
each square wave. The plot is made relative to the 36 TV lines/ 
frame frequency. From this plot can be read numbers that charac- 
terize the system, namely, the square wave response, in TV lines, 
at 10%, 50% and 100% modulation: 470, 230 and 36. 

A2. C omp u tati on of Edges 

Figure 20 pictures the algorithm we designed to detect "coarse" 
edges. It is composed of six arrays of 0, l's and -l's which we call 



c) Amplitude response. O's are a plot of the relative amplitude 
of response in (b) to the image of (a). X's mark the square - 
wave spatial frequencies detected by the top three filters of Fig. 20. 

Fig. 19 (Cont'd) Test pattern and amplitude response of TV camera 


41a 




Number of TV lines that can be detected in image of Fig. 19a: 




K (A + B) * Output for each position in the image 
K * Constant 


Fig. 20. Operations performed on each 7x7 pixel array 
of a digitized image to detect edges. 
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"filters". We assign a special meaning to *, namely, that each num- 
ber in the filter will be multiplied by the gray value in a submatrix 
of the image, that is the same size as the filter, and the products 
summed. (Actually, all filters are made the same size by filling 
them out with zeros to measure 7x7 digits. ) To each 7x7 submatrix 
of the image all six filters are applied. 

Since our original goal was to form a line drawing from spatial 
frequency information and our thinking was influenced by the methods 
of Fourier analysis, we attempted to add each of the terms of Fig. 20. 
The resultc were of little value. Then we were advised by Dr. Azriel 
Rosenfeld to multiply the terms as he did in Ref. 28. The results are 
shown first in Fig. 21 and, after thinning by searching for local maxima, 
in Fig. 22. Multiplication here, as he said, is "counterintuitive " ’ 
it serves to detect an edge. 

The filter at the upper left of Fig. 20 detects two TV lines (one 
dark, one light) in three pixels. Across the full width of the pattern 
of Fig. 19a, 

2/3 x 512 = 308 TV lines 

can be detected by this filter, as indicated at the top of Fig. 20. The 
upper center filter detects two TV lines in five pixels, the upper 
right filter two TV lines in seven pixels. Along the lower row of 
Fig. 20 are detectors of the same square -wave frequencies turned 
90°. At the top of Fig. 20 are the square -wave spatial frequencies 
detected by the filters below them. Plotting these frequencies as X's 
on the graph of Fig. 19c shows the relative response of the camera 


* To get the effect of three dimensions, look first at the rock in the 
foreground, then at the crater in the distance, then at the rock, then 
at the crater, and so on. Occasionally alter the route by following 
the stick or exploring the hills. 
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tube to the frequencies detected. 

When all six filters are overlayed, the 0 at their center is the 
address of the edge that they detect. 

B. Formation of A Line D raw ing 

Plotting all of the ed^es detected by the algorithm of Fig. 20, 
in the images of Fig. 16, results in images which are printed in nega- 
tive form in Fig. 21. That is, sharpness of edge results in brightness 
of display which is represented as blackness in Fig. 21. While the 
span of gray values that can be displayed is only 0 to 31, the result of 
the operations of Fig. 20 is often greater than that. For the display, 
each result is truncated at 31 so that, for every edge in Fig. 16, there 
are usually several lines of dots in Fig. 21. We call this a "coarse 
line drawing. " 

A "drawing" was thought desirable, when this work began, as a 
means of transmitting to earth the appearance of a Mars scene with 
minimum power. The power saving results from the use of a binary 
code in a raster that is always the same size. The position of a bit 
in the raster is thus given by the time of its arrival after an initial 
synchronizing pulse. Figure 22 requires about 1/30 as much power 
to transmit F<g. 16. 

Figure 23 shows how a coarse line drawing is thinned. 

Fig. 23a plots the average of the gray values along seven adjacent 
scan lines in a digitized TV image. There is one gradual transition 
from light to dark at Xj, one abrupt transition at x^. Both are edges. 
Figure 23b shows the result of performing the computation of Fig. 20 
on the scan line of Fig. 23a. The thinning routine detects local 
maxima in gray value (A and B in Fig. 23b), locates Xj and x ^ and 
displays them on the oscilloscope as the lightest gray (Fig. 22c). 

We prefer to present the 
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c) Result of detecting local maxima at and X2 


Fig. 23. Detection of edge and thinning of edge. 
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negative of such a display. Figure 22 is a negative of the result of 
applying this edge -thinning operation to the data of Fig. 21. 

C. Hardware to Detect Edges 

Detection of edges in left and right images can be performed 
by the assembly shown in Fig. 24. Between the analog -to -digital 
converters and the comparator pictured in Fig. 3, are two banks 
of shift registers and computing elements behind each bank. As the 
electron beam, say, of the left camera tube, detects the signal at 
the first pixel of a scan line, that signal is converted to a five bit 
word and fed into the top level of the lower bank of shift registers. 

As the electron beam advances to the second pixel, the first digital 
word advances one position along the top level of the bank of shift 
registers and another word takes its place. 

When the electron beam reaches the end of the first scan line, 
it snaps back to start a second line and the first word that entered 
the top level of the bank of shift registers is shifted through connec- 
tions not shown to become the first word in the second level. At the 
same time the first signal from the second scan line is digitized into 
a five -bit word and fed into the top level. (Words are shown six bits 
long because we aim to digitize to this number. ) The process just 
described continues until, when seven levels are full, computation 
occurs on each 7 pixel x 7 pixel array that passes before the detector 
of edge (Fig. 24 top center). From then on, after each shift, another 
7x7 array is processed until the entire digitized image has been 
processed this way. 

Both the gradient of each edge and its polarity (light -to -dark 
or dark -to -light) were lost by multiplying convolution sums together 
and taking the absolute value of the product. If, instead, each con- 
volution sum is retained, it cam be coded and fed to the comparator 
in the background of Fig. 24. Comparison between left and right 
views will then be between gradient, polarity amd direction of 
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Fig. 24. Insertion of means of detecting edges between 

the stereo TV camera assembly and match space 



detected edges. When range is computed, it will be the range of a 
fairly specific edge (or a line, a corner, or other feature). In the 
opinion of the first author, there should be fewer problems of 
spurious matches. 

Color differences can be determined by employing a three 
color camera in place of the monochromatic one pictured in Fig. 24 
and feeding three digitized color signals into the top-level of each 
shift register. Computation will then be to detect color differences 
as well as gray-level differences. 

The advantages of the shift registers pictured in Fig. 24 arc 
that they are simple and fast. Shift registers of this kind were 
aesigned and are partly constructed (Ref. 29). Performing like cells 
in the retina and cortex of vertebrate animals, this single detector 
and thinner of edge is time -shared with the entire area of an image. 

A further advantage of separating the effect of each spatial 
frequency is that each can then be employed in an automatic focussing 
routine. That is, changes in the low spatial frequency response can 
be used to indicate in which direction focus can be improved, while 
the high spatial frequency response can be used to indicate that focus 
has been achieved. 

D. Computation of Reflecting Properties 

The reflecting properties of a surface can be determined from 
the incident illuminance onto a surface and the luminance of that 
surface. Illuminance, if sunlight, can be measured either by the 
camera -computer with the face of the camera tube protected by a 
neutral density filter, or it can be measured by a sun sensor. 
Luminance can also be measured by the camera-computer. These 
measurements would be performed through other channels than those 
pictured in Fig. 24. 
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IV. RECONSTRUCTING THE APPEARANCE OF A SCENE 


By detecting not only features on the surfaces of objects, but also 
properties of the scene such as the amount and direction of the illumin- 
ance, the appearance of a scene on Mars may be reconstructed on 
earth (Ref. 3D). Figure 25 diagrams rays of light from a single source 
of illuminance, such an sunlight, onto a cylinder. Fig. 26 shows how 
a computer, using information on the shape of on object, on the sources 
of illuminance (mainly from upper right, but also from upper left) and 
on the reflectance can recreate the appearance of that object. The 
reflectance of the cylinder is here assumed to be diffuse and 100 per 
cent. The reflectance of the background is assumed to be diffuse and 
50 per cent. 

Detection of the reflecting properties of a surface, like the 
recognition of objects, requires more than the passive examination of 
a scene. There needs to be an active effort to relate incident illumin- 
ance to reflected luminance. This is particularly true in the detection 
of specular reflectance where highlights have to be found and measured. 
Thus, determination of reflectance needs to be directed by a higher 
authority than the passive visual computers described in this paper. 

Because it will enhance a sense of presence, information on the 
appearance of the scene before a Mars rover should be presented to 
its earth operators stereoscopically. A small room can be built with 
walls that are stereo displays, refreshed by disc or drum memories. 

After the robot has criss-crossed an area several times, it will 
have sent more information to earth than can be displayed at one time. 

A computer can be used in the manner shown in Fig. 26 to picture what 
is known about the scene which the robot will encounter if it makes a 
new traverse across the area. 
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Fig. 2 5. Effect of a source of collimated light 
such as sunlight. 



Fig. 26. Reconstruction of the appearance of an object 
from its shape, its reflectance and the sources 
of light. 
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V. STEREO TV CAMERAS 


In pursuit of the objectives stated in the Introduction, four 
families of stereo TV cameras were designed. After two studies, 
which we called "A" and "B", we designated the cameras as follows: 

C. Single TV camera with tvo optical paths provided by 
mirrors (Figs. 0 and 27). 

D. Two TV cameras gimballed separately for vergence, 
together for pitch. Lenses are of one focal length (Fig. 28). 

E. Two TV cameras with the same gimballing as in D plus 
roll and azimuth gimbals for the whole assembly (Fig. 29). 
Light entering each camera is split into two paths, one of 
which passes through a long focal length lens, the other 
through a short focal length lens. Each path of light forms 
a separate image on the face of the camera tube. 

F. Stereo facsimile camera for a crawling vehicle (Ref. 31). 
The Cl was a conventional TV camera with commercially available 
stereo attachment. With its short focal length lenses and 63mm 
(2. 5 in. ) interocular distance it proved to be of little value for our 
work. The C2 was a study. The C3 was built and is now operating. 

The first configuration of the C3 was the C3a, shown in Fig. 0, 
which employed a three-color wheel between the lens and mirror on 
each side of the system to permit detecting color differences as well 
as luminance differences in the scene. Each mirror of the C3a was 
set and bolted in a fixed position. Experiment showed that the angles 
of the mirrors needed to be adjustable. Accordingly, we mounted 
each mirror in a bearing and arranged that each mirror be turned 
by a micrometer acting against a spring (see Fig. 27). We had to 
remove the color wheels to make room for these mechanisms. In 
both the Types C3a and C3b, the focal length of the lenses is 50mm 
and the interocular distance is 21cm (8.2 5 in.). Focussing is by a 
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Fig, 28. Type D1 stereo TV camera assembly. 

Scale in front of assembly is 6 inches. 



Fig. 29. Type El stereo TV camera assembly. 
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worm gear that moves the TV camera. The C3b could be used in 
tests like that in II L when adjustments of the camera assembly 
have been completed and the program STEREO has been modified 
to compute range from images received along converged axes. The 
axes need to be converged because the angles of acceptance of the 
lenses in the Type C3b are smaller than those of the lenses used in 
the test of II L and the binocular base wider. The advantage of the 
C3b camera assembly over the assemblies about to be described is 
that it places both images on the face of one camera tube, thus 
making it easier to eliminate vertical disparity. 

Both the D and E types of stereo TV camera assemblies 
include two separately gimballed cameras. The D1 frame has been 
built and its pitch and yaw electromechanisms operated under servo 
control (Fig. 27). The El assembly, pictured in Fig. 29, is a design 
on paper of an assembly in which each camera contains optic trains 
of both 40mm and 400mm focal lengths. The axes of the bearings 
supporting the cameras are 21cm (8. 25 in. ) apart in the Type Dl, 
17.73cm (7 in.) apart in the Type El (Ref. 32). 

Each optic train of the El assembly reflects light upward onto 
the face of a vertical camera tube, providmg the two images shown 
in Fig. 30. The optic trains are folded upward to keep the front - 
to-back measurement of the camera as small as possible. The two 
pitch axes of the assembly will permit it to look both straight down 
and straight up. The azimuth gimbal will permit it to look in any 
direction. The roll gimbal will permit it to keep the camera pitch 
axis horizontal. 

The assembly of Fig. 29 has been estimated to weigh 11. 4kg 
(25 lbs. ) when made of light-weight spacecraft materials, as much 
as 34.2kg (75 lbs.) when made of aluminum and steel. In either case 
the assembly can be mounted most effectively on a rover directly 
above the axle, as shown in P ig. 31. An arm and one -fingered hand 
are shown attached to a shoulder of the robot to test *he estimate. 
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made by the visual system, of the size and shape of an object. A 
second arm with a hand for picking up small objects was also 
designed but is not shown here. 

The human eyes with their narrow -angle high -resolution 
central fields and their wide-angle low -resolution peripheral fields 
meet the requirements stated in II N. However, there is no camera 
tube that can provide, with a short focal -length lens, the angular 
resolution of human central vision. The only way to achieve this 
high resolution today is to use a long focal length lens with a 
currently available camera tube. As in the human eye each long 
focal -length lens should be rigidly attached to its finder lens so that 
when a feature, found in the finder, is centered, it will be in the 
field of the long focal-length lens. Such a co -axial-input *itsign was 
achieved in the Type El assembly, but no provision was made for 
focussing. A redesign that provides for focussing is being proposed. 



CAMERA TUBE 


Fig. 30. Diagram of images on the face of each 

camera tube in the type El camera assembly. 


55 




Fig. 31. Possible configuration of a Mars rover. 

Stereo TV camera views scene along lines 
of sight (1), while hand and arm (2) feel and 
accelerometers (3) detect inclination. 
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VI. SUMMARY AND CONCLUSION 


Progress toward automatic recognition of three-dimensional 
objects, reported here, can be judged from at least two points of 
view. One is in terms of hardware built and programs operated. 

A second point of view is of the modelling of the vision of animals 
that recognize three-dimensional objects. 

In this section we summarize our work from the first point 
of view and draw conclusions from it. The same work seen from 
the second point of view is summarized in Ref. 16. There it is 
suggested that the advantage of binocular or s*_reo vision in a robot, 
as in an animal, is economy in the computation of form. When the 
form of a three-dimensional object has been determined, recognition 
appears achievable by a serial matching of stored features with those 
detected in the environment. Such serial matching has been demon- 
strated to be characteristic of human object recognition. 

In this report we considered two approaches to the computa- 
tion of range from a binocular, or stereo, input. In the first 
approach, described in Section II, gray values in the left image of 
a scene were matched with gray values in the right image to deter- 
mine the range of the objects imaged. In the second approach 
(Section III), edges were extracted from each image. 

When using the first approach, we say that an N x N area of 
pixels in the left image is "matched" to an N x N area of pixels in 
the right image when a preselected percentage of the first set of 
pixels has gray values that are the same, within a tolerance e, as 
the gray values of the second set. If the axes of the cameras pro- 
viding the two images are parallel, the disparity of a match between 
left and right images is zero for a point at infinity, 36 in our test 
system for the nearest point. Range is computed from disparity. 

Our two main problems were, first, how to design this robot vision 
so that it will reject "spurious" matches between areas of the left 
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and right views and, second, how to enable it to interpret part of a 
scene that is ambiguous because only one camera can view it. The 
method we devised of determining whether or not a match is 
spurious is to assume that every surface is opaque and that it 
therefore hides different areas of the background from each camera. 
By forming left -view and right -view match spaces and comparing 
them, spurious matches are eliminated. 

This first approach was favored in the work reported here 
because Julesz had performed experiments in which disparity had 
been computed this way automatically. We repeated Julesz' experi- 
ments with random gray-value dot patterns instead of the random 
gray-value dot patterns he used. We defined a "match space" in 
which we employed Julesz' methods of both removing spurious 
matches and of filling ambiguous regions. To demonstrate the 
results of this processing we devised the range map. Employing 
next, random square patterns we devised means of eliminating 
spurious surfaces and of reducing parallelopipeds of matches to 
planes of matches. Finally, we aimed a TV camera at a Mars -like 
scene from two positions, with the axes parallel between positions, 
fed the output of the camera through an analog -to -digital computer 
into a computer where the images from the camera in its two 
positions were compared, line by line. The result of matching, 
eliminating spurious matches and eliminating ambiguities was again 
a range map. 

These steps, when followed by recognition of form, are models 
of what Julesz calls local and global stereopsis. To take the next 
step beyond the recognition of form, namely, the automatic recog- 
nition of objects, it appears that the second approach is needed, 
namely, one in which edges are detected and localized in space. In 
higher vertebrate animals and in our second approach edges are 
detected prior to comparison between left and right views. In s~»ne 
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cases, however, it may be more efficient to detect edges after 
comparison of left and right views, for example, by searching the 
range map. 

So that the above operations can be performed rapidly, we 
have devised a means of shifting as many lines of the images at a 
time as are needed in computation past a single time -shared com- 
puting element. Matched points or edges can be plotted in a single 
match space equipped with two sets of wiring to provide the two 
views required of this space for the removal of ambiguities. 

So that a stereo TV camera assembly can make a sequence of 
observations of a scene, we have designed camera assemblies 
variable in vergence, pitch, roll and yaw. We have built the D1 
assembly that has these properties. A stereo camera assembly, 
we conclude, must employ both short and long focal length lenses 
in each camera, the short to provide the finder lenses, the long to 
provide for identification of features. The one stereo TV camera 
assembly we ha^e both built and operated (Type C3b) has only 
short focal -length lenses and is adjustable only in focus and vergence. 

We have derived formulas for range accuracy and employed 
them to predict the effects of different focal lengths, interocular 
distances and sensor arrays. 

Features on the surfaces of objects and spatial relations among 
these features, transmitted by a robot on Mars to an earth station, 
can be reconstruct . J there into the appearance of objects. We have 
constructed, from such data, the appearance of a cylinder in sunlight. 


* Mergence denotes the convergence -divergence adjustment of 
the two optic axes (Ref. 33). 
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Note to Sections II G and II N 


APPENDIX A 

EQUATIONS FOR RANGE AND UNCERTAINTY IN RANGE 


A. I Th e Geomet ry of Stereo TV Optics with Parallel Axejs 

Figure A-l is a redrawing of Fig. 6a to show the uncertainty in 
range measurement +.\z corresponding to the uncertainty of point S on 
the face of the camera tube. Positive uncertainty is + \z or PM, 
negative uncertainty - \z or PR. The angle of uncertainty <J is also 
introduced in preparation for a discussion of the first strategy of II B. 

In addition to the axis z, there is a parallel axis q in the x-z plane, 
which bisects the interocular distance, 2b. The range, z, or P is 
measured from the optical centers of the lenses, L, which are assumed 
to behave like pinhole lenses. 

From the similarity of triangles POL and LAS in Fig. A. 1, 


z f 
E" s 

bf 


but in Fig. 6, 
s 



bf 2bf 

37T " ~W~ 


(A-l) 


(1) on p. 19 


Thus, for a system with parallel optical axes where range is measured 
from the optical centers of the lenses, the nominal value of the range 
is the product of the interocular distance and the focal length, divided 
by the disparity between the positions of the image on the image plane. 
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A. 2 Derivation of Equation of Range Uncertainty for Stereo TV 
Cameras with Parallel Axes 


If the uncertainty of the position S on the image plane is either As or 
- \s, then the uncertainty in the range is \z or - \z, respectively. 

From Fig. A-l, 


f _ z+Az 
s - \s - b 


(A -2) 


z + \Z 


bf 

s - \s 


(A.-3) 


Subtracting Eq (A-l) from (A -3), we obtain the range uncertainty \z 
due to uncertainty in a position S on the image plane (or equivalently 
an angular uncertainty <5 in determining ft): 


A bf(s- \s) - bfs ^ bf • As 

" s(s- As) ~ s(s- As) 


(A -4) 


Rearranging Eq. (A-l) and substituting into Eq. (A -4), 


Az 



bf - As • z 


(2) on page 38 
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SYMBOL DEFINITIONS 

I * RANGE OF A POINT P FROM 0 
A - POINT WMF RE O p T'C A X 1 S PiFRffS IMAGE PLANE 
I • DISTANCE ON THE IMAGE PLANE FROM POINT A TO IMAGE OF POINT P 
PM = UNCERTAINTY IN RANGE CORRESPONDING TO UNCERTAINTY 
IN THE POSITION OF POINT S.A« 

-A* * PR 


Fig. A-l. Geometry of parallel optics for range finding. 
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